Dealing with messy data

python
datacamp
feature engineering
machine learning
data wrangling
Author

kakamana

Published

March 19, 2023

Dealing with messy data

This chapter will introduce us to messy, incomplete data. In this lesson, we’ll learn how to find missing values in our data and explore multiple approaches to dealing with them. We’ll also use string manipulation to get rid of unwanted characters.

This Dealing with messy data is part of Datacamp course: Feature engineering for machine learning in Python

This is my learning experience of data science through DataCamp. These repository contributions are part of my learning journey through my graduate program masters of applied data sciences (MADS) at University Of Michigan, DeepLearning.AI, Coursera & DataCamp. You can find my similar articles & more stories at my medium & LinkedIn profile. I am available at kaggle & github blogs & github repos. Thank you for your motivation, support & valuable feedback.

These include projects, coursework & notebook which I learned through my data science journey. They are created for reproducible & future reference purpose only. All source code, slides or screenshot are intellactual property of respective content authors. If you find these contents beneficial, kindly consider learning subscription from DeepLearning.AI Subscription, Coursera, DataCamp

Code
import pandas as pd

Why do missing values exist?

How gaps in data occur
    * Data not being collected properly
    * Collection and management errors
    * Data intentionally being omitted
    * Could be created due to transformations of the data
Why we care?
    * Some models cannot work with missing data (Nulls/NaN)
    * Missing data may be a sign a wider data issue
    * Missing data can be a useful feature

How sparse is my data?

Most data sets contain missing values, often represented as NaN (Not a Number). If you are working with Pandas you can easily check how many missing values exist in each column.

Code
so_survey_df = pd.read_csv('dataset/Combined_DS_v10.csv')
so_survey_df.head()
SurveyDate FormalEducation ConvertedSalary Hobby Country StackOverflowJobsRecommend VersionControl Age Years Experience Gender RawSalary
0 2/28/18 20:20 Bachelor's degree (BA. BS. B.Eng.. etc.) NaN Yes South Africa NaN Git 21 13 Male NaN
1 6/28/18 13:26 Bachelor's degree (BA. BS. B.Eng.. etc.) 70841.0 Yes Sweeden 7.0 Git;Subversion 38 9 Male 70,841.00
2 6/6/18 3:37 Bachelor's degree (BA. BS. B.Eng.. etc.) NaN No Sweeden 8.0 Git 45 11 NaN NaN
3 5/9/18 1:06 Some college/university study without earning ... 21426.0 Yes Sweeden NaN Zip file back-ups 46 12 Male 21,426.00
4 4/12/18 22:41 Bachelor's degree (BA. BS. B.Eng.. etc.) 41671.0 Yes UK 8.0 Git 39 7 Male £41,671.00
Code
# Subset the DataFrame
sub_df = so_survey_df[['Age','Gender']]

# Print the number of non-missing values
print(sub_df.notnull().sum())
Age       999
Gender    693
dtype: int64

Finding the missing values

While having a summary of how much of your data is missing can be useful, often you will need to find the exact locations of these missing values

Code
sub_df.head(10)
Age Gender
0 21 Male
1 38 Male
2 45 NaN
3 46 Male
4 39 Male
5 39 Male
6 34 Male
7 24 Female
8 23 Male
9 36 NaN
Code
# Print the locations of the missing values
print(sub_df.head(10).isnull())
     Age  Gender
0  False   False
1  False   False
2  False    True
3  False   False
4  False   False
5  False   False
6  False   False
7  False   False
8  False   False
9  False    True
Code
# Print the locations of the non-missing values
print(sub_df.head(10).notnull())
    Age  Gender
0  True    True
1  True    True
2  True   False
3  True    True
4  True    True
5  True    True
6  True    True
7  True    True
8  True    True
9  True   False

Dealing with missing values (I)

Issues with deletion
    It deletes vaild data points
    Relies on randomness
    Reduces information

Listwise deletion

The simplest way to deal with missing values in your dataset is to remove them, also called ‘listwise deletion’.

Sometimes you’ll want to remove all missing values in your data, and sometimes you’ll just want to remove a column if too many values are missing in it.

Code
print(so_survey_df.shape)
(999, 11)
Code
# Create a new DataFrame dropping all incomplete rows
no_missing_values_rows = so_survey_df.dropna()

# Print the shape of the new DataFrame
print(no_missing_values_rows.shape)
(264, 11)
Code
# Create a new DataFrame dropping all columns with incomplete rows
no_missing_values_cols = so_survey_df.dropna(how='any', axis=1)

# Print the shape of the new DataFrame
print(no_missing_values_cols.shape)
(999, 7)
Code
# Drop all rows where Gender is missing
no_gender = so_survey_df.dropna(subset=['Gender'])

# Print the shape of the new DataFrame
print(no_gender.shape)
(693, 11)

Replacing missing values with constants

In many cases, removing all missing data may be the right thing to do, but you’re omitting a lot of information.

It’s possible to find categorical columns with missing values that are valid information in themselves, like someone refusing to answer a survey question. Fill in all missing values with a new category entirely, like ‘No response given’.

Code
# Print the count of occurrences
print(so_survey_df['Gender'].value_counts())
Male                                                                         632
Female                                                                        53
Female;Male                                                                    2
Transgender                                                                    2
Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming      1
Male;Non-binary. genderqueer. or gender non-conforming                         1
Non-binary. genderqueer. or gender non-conforming                              1
Female;Transgender                                                             1
Name: Gender, dtype: int64
Code
# Replace missing values
so_survey_df['Gender'].fillna('Not Given', inplace=True)

# Print the count of each value
print(so_survey_df['Gender'].value_counts())
Male                                                                         632
Not Given                                                                    306
Female                                                                        53
Female;Male                                                                    2
Transgender                                                                    2
Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming      1
Male;Non-binary. genderqueer. or gender non-conforming                         1
Non-binary. genderqueer. or gender non-conforming                              1
Female;Transgender                                                             1
Name: Gender, dtype: int64

Dealing with missing values (II)

Deleting missing values
    Can't delete rows with missing values in the test set
What else can you do?
    Categorical columns: Replace missing values with the most common occurring value or with a string that flags missing values such as 'None'
    Numerical columns: Replace missing values with a suitable value

Filling continuous missing values

In the last lesson, you dealt with different methods of removing data missing values and filling in missing values with a fixed string. These approaches are valid in many cases, particularly when dealing with categorical columns but have limited use when working with continuous values. In these cases, it may be most valid to fill the missing values in the column with a value calculated from the entries present in the column.

Code
so_survey_df['StackOverflowJobsRecommend'].head()
0    NaN
1    7.0
2    8.0
3    NaN
4    8.0
Name: StackOverflowJobsRecommend, dtype: float64
Code
# Fill missing values with the mean
so_survey_df['StackOverflowJobsRecommend'].fillna(so_survey_df['StackOverflowJobsRecommend'].mean(), inplace=True)

# Print the first five rows of StackOverflowJobsRecommend column
print(so_survey_df['StackOverflowJobsRecommend'].head())
0    7.061602
1    7.000000
2    8.000000
3    7.061602
4    8.000000
Name: StackOverflowJobsRecommend, dtype: float64

Fill missing values with the mean

so_survey_df[‘StackOverflowJobsRecommend’].fillna(so_survey_df[‘StackOverflowJobsRecommend’].mean(), inplace=True)

Round the StackOverflowJobsRecommend values

so_survey_df[‘StackOverflowJobsRecommend’] = round(so_survey_df[‘StackOverflowJobsRecommend’])